AITopics | multi-modal dependency tree

Multi-modal Dependency Tree for Video Captioning

Neural Information Processing SystemsDec-23-2025, 23:47:24 GMT

Generating fluent and relevant language to describe visual content is critical for the video captioning task. Many existing methods generate captions using sequence models that predict words in a left-to-right order. In this paper, we investigate a graph-structured model for caption generation by explicitly modeling the hierarchical structure in the sentences to further improve the fluency and relevance of sentences. To this end, we propose a novel video captioning method that generates a sentence by first constructing a multi-modal dependency tree and then traversing the constructed tree, where the syntactic structure and semantic relationship in the sentence are represented by the tree topology. To take full advantage of the information from both vision and language, both the visual and textual representation features are encoded into each tree node. Different from existing dependency parsing methods that generate uni-modal dependency trees for language understanding, our method construct s multi-modal dependency trees for language generation of images and videos. We also propose a tree-structured reinforcement learning algorithm to effectively optimize the captioning model where a novel reward is designed by evaluating the semantic consistency between the generated sub-tree and the ground-truth tree. Extensive experiments on several video captioning datasets demonstrate the effectiveness of the proposed method.

multi-modal dependency tree, name change, video captioning, (2 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.83)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.60)

Add feedback

Supplementary Material for Multi-modal Dependency Tree for Video Captioning

Neural Information Processing SystemsOct-3-2025, 10:01:57 GMT

The evaluation results on the Charades Captions dataset are shown in Table 2. Figure 1: Qualitative results of the generated tree structure and sentences on the MSR-VTT dataset. Table 1: The average sentence lengths of ground-truth captions and the captions generated by "w/o Lower average edit distance is better. In this section, we illustrate more details of human evaluation. We recruited 10 annotators to carry out the human evaluation process. The user interface for human evaluation is shown in Figure 4. To ensure Do the main claims made in the abstract and introduction accurately reflect the paper's Did you discuss any potential negative societal impacts of your work?

artificial intelligence, dataset, natural language, (11 more...)

Neural Information Processing Systems

Country:

Asia > China > Beijing > Beijing (0.06)
Oceania > Australia > New South Wales > Sydney (0.05)
North America > United States > New York > Monroe County > Rochester (0.05)

Technology: Information Technology > Artificial Intelligence > Natural Language (0.72)

Add feedback

3473decccb0509fb264818a7512a8b9b-Paper.pdf

Neural Information Processing SystemsOct-3-2025, 10:01:55 GMT

artificial intelligence, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country:

Asia > China > Beijing > Beijing (0.04)
North America > United States > New York > Monroe County > Rochester (0.04)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Vision (0.96)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Multi-modal Dependency Tree for Video Captioning

Neural Information Processing SystemsOct-10-2024, 00:46:18 GMT

Generating fluent and relevant language to describe visual content is critical for the video captioning task. Many existing methods generate captions using sequence models that predict words in a left-to-right order. In this paper, we investigate a graph-structured model for caption generation by explicitly modeling the hierarchical structure in the sentences to further improve the fluency and relevance of sentences. To this end, we propose a novel video captioning method that generates a sentence by first constructing a multi-modal dependency tree and then traversing the constructed tree, where the syntactic structure and semantic relationship in the sentence are represented by the tree topology. To take full advantage of the information from both vision and language, both the visual and textual representation features are encoded into each tree node.

multi-modal dependency tree, video captioning

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.66)

Add feedback

Collaborating Authors

multi-modal dependency tree

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Multi-modal Dependency Tree for Video Captioning

Supplementary Material for Multi-modal Dependency Tree for Video Captioning

3473decccb0509fb264818a7512a8b9b-Paper.pdf

Multi-modal Dependency Tree for Video Captioning